Latent Topic Model for Indexing Arabic Documents

نویسندگان

Rami Ayadi

Mohsen Maraoui

Mounir Zrigui

چکیده

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction. Latent Topic Model for Indexing Arabic Documents

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable...

متن کامل

Big Data Categorization for Arabic Text Using Latent Semantic Indexing and Clustering

Documents categorization is an important field in the area of natural language processing. In this paper, we propose using Latent Semantic Indexing (LSI), singular value decomposing (SVD) method, and clustering techniques to group similar unlabeled document into pre-specified number of topics. The generated groups are then categorized using a suitable label. For clustering, we used Expectation–...

متن کامل

Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond

The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate term...

متن کامل

Semi-automatic Construction of Topic Ontologies

In this paper, we review two techniques for topic discovery in collections of text documents (Latent Semantic Indexing and K-Means clustering) and present how we integrated them into a system for semiautomatic topic ontology construction. The OntoGen system offers support to the user during the construction process by suggesting topics and analyzing them in real time. It suggests names for the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IJIRR

دوره 4 شماره

صفحات -

تاریخ انتشار 2014

Latent Topic Model for Indexing Arabic Documents

نویسندگان

چکیده

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Big Data Categorization for Arabic Text Using Latent Semantic Indexing and Clustering

Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond

Semi-automatic Construction of Topic Ontologies

عنوان ژورنال:

اشتراک گذاری